{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"table_of_content\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1: Bag-of-Words Exercises\n",
    "\n",
    "In the following, we will convert our corpus of 10-k documents into bag-of-words, and convert them into document vectors using Term-Frequency-Inverse-Document-Frequency (tf-idf) re-weighting. Afterward, we will compute sentiments and similarity metrics. Since we will be reusing our notebook, so the various sections are linked below:\n",
    "\n",
    "1. <a href=\"#bag_of_word\">Compute bag-of-words </a>: implement `bag_of_words` that converts tokenized words into a dictionary of word-counts\n",
    "\n",
    "2. <a href=\"#sentiment\">Sentiments </a>: using wordlists, compute positive and negative sentiments from bag-of-words. Implement `get_sentiment`\n",
    "\n",
    "For solutions, see [bagofwords_solutions.py](./bagofwords_solutions.py). You can load the functions by simply calling \n",
    "\n",
    "`from bagofwords_solutions import *`\n",
    "\n",
    "# Part 2: Document-Vector Exercises\n",
    "\n",
    "3. <a href=\"#compute_idf\">Compute idf </a>: computing the inverse document frequency, implement `get_idf`\n",
    "\n",
    "4. <a href=\"#compute_tf\">Compute tf </a>: computing the term frequency, implement `get_tf`\n",
    "\n",
    "5. <a href=\"#doc_vector\">Document vector </a>: using the functions `get_idf` and `get_tf`, compute a word_vector for each document in the corpus\n",
    "6. <a href=\"#similarity\">Similarities </a>: comparing two vectors, and compute cosine and jacard similarity metrics. Implement `get_cos` and `get_jac`\n",
    "\n",
    "For solutions, see [bagofwords_solutions.py](./bagofwords_solutions.py). You can load the functions by simply calling \n",
    "\n",
    "`from bagofwords_solutions import *`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Initialization\n",
    "\n",
    "First we read in our 10-k documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download some excerpts from 10-K files\n",
    "from download10k import get_files\n",
    "\n",
    "CIK = {'ebay': '0001065088', 'apple':'0000320193', 'sears': '0001310067'}\n",
    "get_files(CIK['ebay'], 'EBAY')\n",
    "get_files(CIK['apple'], 'AAPL')\n",
    "get_files(CIK['sears'], 'SHLDQ')\n",
    "\n",
    "\n",
    "# get a list of all 10-ks in our directory\n",
    "files=! ls *10k*.txt\n",
    "print(\"10-k files: \",files)\n",
    "files = [open(f,\"r\").read() for f in files]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "here we define useful functions to tokenize the texts into words, remove stop-words, and lemmatize and stem our words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# for nice number printing\n",
    "np.set_printoptions(precision=3, suppress=True)\n",
    "\n",
    "# tokenize and clean the text\n",
    "import nltk\n",
    "from nltk.stem import WordNetLemmatizer, SnowballStemmer\n",
    "from collections import Counter\n",
    "from nltk.corpus import stopwords\n",
    "from nltk import word_tokenize\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "# tokenize anything that is not a number and not a symbol\n",
    "word_tokenizer = RegexpTokenizer(r'[^\\d\\W]+')\n",
    "\n",
    "nltk.download('stopwords')\n",
    "nltk.download('wordnet')\n",
    "\n",
    "\n",
    "sno = SnowballStemmer('english')\n",
    "wnl = WordNetLemmatizer()\n",
    "\n",
    "# get our list of stop_words\n",
    "nltk.download('stopwords')\n",
    "stop_words = set(stopwords.words('english')) \n",
    "# add some extra stopwords\n",
    "stop_words |= {\"may\", \"business\", \"company\", \"could\", \"service\", \"result\", \"product\", \n",
    "               \"operation\", \"include\", \"law\", \"tax\", \"change\", \"financial\", \"require\",\n",
    "               \"cost\", \"market\", \"also\", \"user\", \"plan\", \"actual\", \"cash\", \"other\",\n",
    "               \"thereto\", \"thereof\", \"therefore\"}\n",
    "\n",
    "# useful function to print a dictionary sorted by value (largest first by default)\n",
    "def print_sorted(d, ascending=False):\n",
    "    factor = 1 if ascending else -1\n",
    "    sorted_list = sorted(d.items(), key=lambda v: factor*v[1])\n",
    "    for i, v in sorted_list:\n",
    "        print(\"{}: {:.3f}\".format(i, v))\n",
    "\n",
    "# convert text into bag-of-words\n",
    "def clean_text(txt):\n",
    "    lemm_txt = [ wnl.lemmatize(wnl.lemmatize(w.lower(),'n'),'v') \\\n",
    "                for w in word_tokenizer.tokenize(txt) if \\\n",
    "                w.isalpha() and w not in stop_words ]\n",
    "    return [ sno.stem(w) for w in lemm_txt if w not in stop_words and len(w) > 2 ]\n",
    "\n",
    "corpus = [clean_text(f) for f in files]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"bag_of_words\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Bag-of-Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Implement a function that converts a list of tokenized words into bag-of-words, i.e. a dictionary that outputs the frequency count of a word\n",
    "\n",
    "** python already provide the `collections.Counter` class to perform this, but I encourage you to implement your own function as an exercise\n",
    "\n",
    "<a href=\"#table_of_content\">back to top</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "def bag_of_words(words):\n",
    "    # TO DO\n",
    "    pass\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"sentiment\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Sentiments\n",
    "Count the fraction of words that appear in a wordlist, returning a sentiment score between 0 and 1:\n",
    "\n",
    "$$\n",
    "\\textrm{score} = \\frac{\\textrm{Number of words in document matching wordlist}}{\\textrm{Number of words in document}}\n",
    "$$\n",
    "\n",
    "Implement the score in a function `get_sentiment(words, wordlist)`, where words is a list of words. Feel free to use the previous `bag_of_words` function. \n",
    "(*for extra challenge, try to code the function in one-line*)\n",
    "\n",
    "Here, I've included a positive and negative wordlist that I constructed by hand. Due to copyright issues, we are not able to provide other commonly used wordlists. I encourage you to try out different wordlists on your own.\n",
    "\n",
    "<a href=\"#table_of_content\">back to top</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load wordlist first\n",
    "import pickle\n",
    "\n",
    "with open('positive_words.pickle', 'rb') as f:\n",
    "    positive_words = pickle.load(f)\n",
    "    # also need to stem and lemmatize the text\n",
    "    positive_words = set(clean_text(\" \".join(positive_words)))\n",
    "    \n",
    "with open('negative_words.pickle', 'rb') as f:\n",
    "    negative_words = pickle.load(f)\n",
    "    negative_words = set(clean_text(\" \".join(negative_words)))\n",
    "    \n",
    "# check out the list\n",
    "# print(\"positive words: \", positive_words)\n",
    "# print(\"negative words: \", negative_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_sentiment(txt, wordlist):\n",
    "    # TO DO\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test your function\n",
    "positive_sentiments = np.array([ get_sentiment(c, positive_words) for c in corpus ])\n",
    "print(positive_sentiments)\n",
    "\n",
    "negative_sentiments = np.array([ get_sentiment(c, negative_words) for c in corpus ])\n",
    "print(negative_sentiments)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**before continuing part 2, go through the lesson material first!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"compute_idf\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2 Document Vector Exercises\n",
    "\n",
    "## 3. Computer idf\n",
    "Given a corpus, or a list of bag-of-words, we want to compute for each word $w$, the inverse-document-frequency, or ${\\rm idf}(w)$. This can be done in a few steps:\n",
    "\n",
    "1. Gather a set of all the words in all the bag-of-words (python set comes in handy, and the union operator `|` is useful here)\n",
    "\n",
    "\n",
    "2. Loop over each word $w$, and compute ${\\rm df}_w$, the number of documents where this word appears at least once. A dictionary is useful for keeping track of ${\\rm df}_w$\n",
    "\n",
    "\n",
    "3. After computing ${\\rm df}_w$, we can compute ${\\rm idf}(w)$. There are usually two possibilities, the simplest one is \n",
    "$${\\rm idf}(w)=\\frac{N}{{\\rm df}_w}$$\n",
    "Where $N$ is the total number of documents in the corpus and $df_w$ the number of documents that contain the word $w$. Frequently, a logarithm term is added as well\n",
    "$${\\rm idf}(w)=\\log\\frac{N}{{\\rm df}_w}$$\n",
    "One issue with using the logarithm is that when ${\\rm df}_w = N$, ${\\rm idf}(w)=0$, indicating that words common to all documents would be ignored. If we don't want this behavior, we can define ${\\rm idf}(w)=\\log\\left(1+N/{\\rm df}_w\\right)$ or ${\\rm idf}(w)=1+\\log\\left(N/{\\rm df}_w\\right)$ instead. For us, we'll not use the extra +1 for ${\\rm idf}$.\n",
    "\n",
    "In the following, define a function called `get_idf(corpus, include_log=True)` that computes ${\\rm idf}(w)$ for all the words in a corpus, where `corpus` for us is a processed list of bag-of-words (stemmed and lemmatized). The optional parameter `include_log` includes the logarithm in the computation.\n",
    "\n",
    "<a href=\"#table_of_content\">back to top</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compute idf\n",
    "def get_idf(corpus, include_log=True):\n",
    "    # TO DO\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should expect to see many idf values = 0! This is by design, because we have ${\\rm idf}(w)=\\log N_d/{\\rm df}_w$ and $N_d/{\\rm df}_w=1$ for the most common words!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# test your code\n",
    "idf=get_idf(corpus)\n",
    "print_sorted(idf, ascending=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"compute_tf\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Compute tf\n",
    "\n",
    "Below we will compute ${\\rm tf}(w,d)$, or the term frequency for a given word $w$, in a given document $d$. Since our ultimate goal is to compute a document vector, we'd like to keep a few things in mind\n",
    "\n",
    "1. Store the ${\\rm tf}(w, d)$ for each word in a document as a dictionary\n",
    "2. Even when a word doesn't appear in the document $d$, we still want to keep a $0$ entry in the dictionary. This is important when we convert the dictionary to a vector, where zero entries are important\n",
    "\n",
    "\n",
    "There are multiple definitions for ${\\rm tf}(w,d)$, the simplest one is\n",
    "\n",
    "$$\n",
    "{\\rm tf}(w,d)=\\frac{f_{w,d}}{a_d}\n",
    "$$\n",
    "\n",
    "where $f_{w,d}$ is the number of occurence of the word $w$ in the document $d$, and $a$ the average occurence of all the words in that document for normalization. Just like ${\\rm idf}(w)$, a logarithm can be added\n",
    "\n",
    "$$\n",
    "{\\rm tf}(w,d)=\\begin{cases}\n",
    "\\frac{1+\\log f_{w,d}}{1+\\log a_d} & f_{w,d} > 0 \\\\\n",
    "0 & f_{w,d} = 0 \\\\\n",
    "\\end{cases}\n",
    "$$\n",
    "\n",
    "Implement the function `get_df(txt, include_log=True)` that computes ${\\rm tf}(w,d)$ for each word in the document (returns a defaultdict(int), so that when supplying words not in the document the dictionary will yield zero instead of an error). Also include the optional parameter `include_log` that enables the additional logarithm term in the computation. I suggest also adding a function called `_tf` that computes the function above. \n",
    "\n",
    "<a href=\"#table_of_content\">back to top</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from math import *\n",
    "\n",
    "def _tf(freq, avg, include_log=True):\n",
    "    # TO DO\n",
    "    pass\n",
    "\n",
    "def get_tf(txt, include_log=True):\n",
    "    # TO DO\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfs = [ get_tf(c) for c in corpus ]\n",
    "print_sorted(tfs[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"doc_vector\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Document Vector\n",
    "Combine the implementation for ${\\rm tf}(w,d)$ and ${\\rm idf}(w)$ to compute a word-vector for each document in a corpus. Don't forget the zero-padding that is needed when a word appears in some document but not others. \n",
    "\n",
    "Implmenet the function `get_vectors(tf,idf)`, the input is the output computed by `get_tf` and `get_idf`\n",
    "\n",
    "(*optional challenge: implement in one line!*)\n",
    "\n",
    "<a href=\"#table_of_content\">back to top</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_vector(tf, idf):\n",
    "    # TO DO\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test your code\n",
    "doc_vectors = [ get_vector(tf, idf) for tf in tfs ]\n",
    "\n",
    "for v in doc_vectors:\n",
    "    print(v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id =\"similarity\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Similarity\n",
    "\n",
    "Given two word-vectors $\\vec u$ (or $u_i$) and $\\vec v$ (or $v_i$), corresponding to two documents, we want to compute different similarity metrics. \n",
    "\n",
    "1. Cosine similarity, defined by \n",
    "$$\n",
    "{\\rm Sim}_{\\cos} = \\frac{\\vec u \\cdot \\vec v}{|\\vec u| |\\vec v|}\n",
    "$$\n",
    "\n",
    "2. Jaccard similarity, defined by\n",
    "$$\n",
    "{\\rm Sim}_{\\rm Jaccard} = \\frac{\\sum_i \\min\\{u_i, v_i\\}}{\\sum_i \\max\\{u_i, v_i\\}}\n",
    "$$\n",
    "\n",
    "Implement the function similarity as `sim_cis(u,v)` and `sim_jac(u,v)`. Utilize the numpy functions `numpy.linalg.norm` to compute norm of a vector and `np.dot` for computing dot-products. `np.minimum` and `np.maximum` are also useful vectorized pair-wise minimum and maximum functions.\n",
    "\n",
    "(*optional challenge: implement both functions in one line!*)\n",
    "\n",
    "<a href=\"#table_of_content\">back to top</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.linalg import norm\n",
    "\n",
    "def sim_cos(u,v):\n",
    "    # TO DO\n",
    "    pass\n",
    "\n",
    "def sim_jac(u,v):\n",
    "    # TO DO\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test your co\n",
    "# compute all the pairwise similarity metrics\n",
    "size = len(doc_vectors)\n",
    "matrix_cos = np.zeros((size,size))\n",
    "matrix_jac = np.zeros((size,size))\n",
    "\n",
    "for i in range(size):\n",
    "    for j in range(size):\n",
    "        u = doc_vectors[i]\n",
    "        v = doc_vectors[j]\n",
    "        matrix_cos[i][j] = sim_cos(u,v)\n",
    "        matrix_jac[i][j] = sim_jac(u,v)\n",
    "        \n",
    "print(\"Cosine Similarity:\")\n",
    "print(matrix_cos)\n",
    "\n",
    "print()\n",
    "print(\"Jaccard Similarity:\")\n",
    "print(matrix_jac)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Good Job! You've finished all the exercises!\n",
    "\n",
    "Here is an optional bonus challenge. We often need to debug other people's code to figure out what's wrong. It's particularly difficult when the code doesn't give errors but computes the wrong quantity. Can you describe why the following implementations for some of the exercises above may be wrong? Highlight the words at the bottom to reveal the solutions!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_idf_wrong(corpus, include_log=True):\n",
    "    freq = defaultdict(int)\n",
    "    for c in corpus:\n",
    "        for w in c:\n",
    "            freq[w] += 1\n",
    "        \n",
    "    N = len(corpus)\n",
    "    if include_log:\n",
    "        return { w:log(N/freq[w]) for w in freq }\n",
    "    else:\n",
    "        return { w:N/freq[w] for w in freq }\n",
    "\n",
    "\n",
    "def get_sentiment_wrong(txt, wordlist):\n",
    "    matching_words = [ w for w in wordlist if w in txt ]\n",
    "    return len(matching_words)/len(txt)\n",
    "\n",
    "def get_vectors_wrong(tf, idf):\n",
    "    return np.array([ tf[w]*idf[w] for w in tf ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solutions\n",
    "\n",
    "Drag your mouse over the white space below this cell, and you'll see details about the solutions.  Or, if it's easier, just double-click on the white space below this cell, which will reveal the cell with hidden text.  Also, please check out the file [bagofwords_solutions.py](bagofwords_solutions.py)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=\"white\">\n",
    "Solution\n",
    "\n",
    "get_idf: the defaultdict freq here computes the total number of occurences in all the document. We only want to count it once when a word appears in a document. This may lead to a document frequency larger than N, leading to a negative idf! \n",
    "\n",
    "get_sentiment_wrong: if a word w appears twice in the document, it won't be counted properly!\n",
    "\n",
    "get_vectors_wrong: tf may not contain all the words in idf. We need zero padding for words that appear in idf but not in tf! \n",
    "</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
